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<^ . 

-—^ ' Abstract. In this paper, we propose a first application of data mining 

^vj . techniques to propositional satisfiability. Our proposed Mining4SAT ap- 

proach aims to discover and to exploit hidden structural knowledge for 
(-«) I reducing the size of propositional formulae in conjunctive normal form 

^^ . (CNF). Mining4SAT combines both frequent itemset mining techniques 

^^ ' and Tseitin's encoding for a compact representation of CNF formulae. 

^>0 , The experiments of our Mining4SAT approach show interesting reduc- 

tions of the sizes of many application instances taken from the last SAT 
competitions. 



1 Introduction 



Propositional satisfiability (SAT) became a core technology in many application 

domains, such as formal verification, planning and various new applications de- 

^ ■ rived by the recent impressive progress in practical SAT solving. Propositional 

formulae in conjunctive normal form (CNF) is the standard input format for 
propositional satisfiability. Such convenient CNF form is derived from a general 

^^ I boolean formula using the well-known Tseitin encoding |3] . Two important flaws 

were identified and largely discussed in the literature. First, it is often argued 
that by encoding arbitrary propositional formulae in CNF, structural properties 

^T) • of the original problem are not reflected in the CNF formula. Secondly, even if 

such translation is linear in the size of the original formula, a huge CNF formula 
might result when encoding real-world problems. Some instances exceed the ca- 
pacity of the available memory, and even if the instance can be stored, the time 
needed for reading the input instance might be higher than its solving time, 
j^ I To address this problem, developing a more compact representation is clearly 

■ ■ an interesting research issue. By compact encoding of formulae, we have in mind 

a representation model which through its use of structural properties results in 
the most compact possible formula. 

Two promising models were proposed these last years. The first, proposed 
by H. Dixon et al |2], uses group theory to represent several classical clauses by 
a single clause called an "augmented clause". The second model was proposed 
by M. L. Ginsberg et al |3l, called QPROP ("quantified propositional logic"), 
which may be seen as a propositional formula extended by the introduction of 
quantifications over finite domains, i.e. first order logic limited to finite types 
and without functional symbols. The problem rises in finding efficient solving 
techniques of formulae encoded using such models. 



More recently, an original approach for compacting sets of binary clauses was 
proposed by J. Rintanen in [7]. Binary clauses are ubiquitous in propositional for- 
mulae that represent real- world problems ranging from model-checking problems 
in computer-aided verification to AI planning problems. In [7], using auxiliary 
variables, it is shown how constraint graphs that contain big cliques or bi-cliques 
of binary clauses can be represented more compactly than the quadratic and ex- 
plicit representation. The main limitation of this approach lies in its restriction 
to particular sets of binary clauses whose constraints graph represents cliques or 
bi-cliques. Such particular regularities can caused by the presence of an at-most- 
one constraint over a subset of variables, forbidding more than one of them to 
be true at a time. 

In data mining community, several models and techniques for discovering 
interesting patterns in large databases has been proposed in the last few years. 
The problem of mining frequent itemsets is well-known and essential in data 
mining, knowledge discovery and data analysis. Since the first article of Agrawal 
[1] on association rules and itemset mining, the huge number of works, challenges, 
datasets and projects show the actual interest in this problem (see [5] for a recent 
survey) . 

Our goal in this work is to address the problem of finding compact represen- 
tation of CNF formulae. Our proposed Mining4SAT approach aims to discover 
hidden structures from arbitrary CNF formulae and to exploit them to reduce 
the overall size of the CNF formula while preserving satisfiability. Mining4SAT 
makes an original use for SAT of an exciting novel application domain, namely, 
the data mining task of finding frequent itemset from 0-1 transaction databases 

Recently, a first constraint programming (CP) based data mining frame- 
work was proposed by Luc De Raedt et al. in [6] for itemset mining. This new 
framework offers a declarative and flexible representation model. It allows data 
mining problems to benefit from several generic and efficient CP solving tech- 
niques 0. This first study leads to the first CP approach for itemset mining 
displaying nice declarative opportunities while opening interesting perspectives 
to cross fertilization between data-mining, constraint programming and propo- 
sitional satisfiability. 

In this paper, we are particularly interested in the other side of this innova- 
tive connection between these two research domains, namely how data-mining 
can be helpful for SAT. We present the first data-mining approach for Boolean 
Satisfiability. We show that itemset mining techniques are very suitable for dis- 
covering interesting patterns from CNF formulae. Such patterns are then used 
to rewrite the CNF formula more compactly. We also show how sets of binary 
clauses can be also compacted by our approach. Wa also prove that our approach 
can automatically achieve similar reductions as in [7], on bi-cliques and cliques 
of binary clauses. It is also important to note, that our proposed mining4SAT 
approach is incremental. Indeed, our method can be applied incrementally or in 
parallel on the subsets of any partition of the original CNF formula. This will 



be particularly helpful for huge CNF formula that can not be entirely stored in 
memory. 

2 Frequent Itemset Mining Problem 

2.1 Preliminary Notations and Definitions 

Let I be a set of items. A set / C I is called an itemset. A transaction is 
a couple {tid,I) where tid is the transaction identifier and / is an itemset. A 
transaction database I? is a finite set of transactions over I where for all two 
different transactions, they do not have the same transaction identifier. We say 
that a transaction {tid, I) supports an itemset J if J C /. 

The cover of an itemset / in a transaction database 2? is the set of identifiers 
of transactions in V supporting /: C{I,V) = {tid | {tid, J) € V and / C J}. 
The support of an itemset J in 2? is defined by: S{I,'D) =| C{I,V) \. Moreover, 
the frequency of / in 2? is defined by: J'{I , T)) — | ', ' . 

For example, let us consider the transaction database in Table [1] Each trans- 
action corresponds to the favorite writers of a library member. For instance, we 
have S{{Hem,ingway, Melville},!)) — |{002,004}| = 2 and F{{Heming'way, 
Melville], V) = \. 



tid 


itemset 


001 


Joyce, Beckett, Proust 


002 


Faulkner, Hemingway, Melville 


003 


■Joyce, Proust 


004 


Hemingway, Melville 


005 


Flaubert, Zola 


006 


Hemingway, Golding 



Table 1. An example of transaction database T) 



Let I? be a transaction database over T and A a minimal support threshold. 
The frequent itemset mining problem consists of computing the following set: 
J^IM{V,\) = {I CI \S{I,V) > A}. 

The problem of computing the number of frequent itemsets is #P-hard [1]. 
The complexity class #P corresponds to the set of counting problems associated 
with a decision problems in NP. For example, counting the number of models 
satisfying a CNF formula is a #P problem. 



2.2 Maximal and Closed Frequent Itemsets 

Let us now define two condensed representations of the set of all frequent item- 
sets: maximal and closed frequent itemsets. 



Definition 1 (Maximal Frequent Itemset). LetV be a transaction database, 
A a minimal support threshold and I G J-XAd {V, X). I is called maximal when 
for all I' D /, /' ^ TXM{'D, A) (I' is not a frequent itemset). 

We denote by A4AX{'D,X) the set of all maximal frequent itemsets in V 
with A as a minimal support threshold. For instance, in the previous example, 
we have A4AX{T), 2) = {{Joyce, Proust}, {Hemingway, Melville}}. 

Definition 2 (Closed Frequent Itemset). Let V be a transaction database, 
A a minimal support threshold and I G J-XAA {T), X). I is called closed when for 

aiir D I, c(i,v)^c{r,v). 

We denote by CLO{V, A) the set of all closed frequent itemsets in V with A as 
a minimal support threshold. For instance, we have CCO{'D, 2) — {{H emingway} , 
{Joyce, Proust}, {Hemingway, Melville}}. In particular, let us note that we 
have C{{H emingway}, V) = {002, 00A,006} andC{{Hemingway, Melville}, V) = 
{002,004}. That explains why {Hemingway} and {Hemingway, Melville} are 
both closed. One can easily see that if all the closed (resp. maximal) frequent 
itemsets are computed, then all the frequent itemsets can be computed without 
using the corresponding database. Indeed, the frequent itemsets correspond to 
all the subsets of the closed (resp. maximal) frequent itemsets. 

Clearly, the number of maximal (resp. closed) frequent itemsets is signifi- 
cantly smaller than the number of frequent itemsets. Nonetheless, this number 
is not always polynomial in the size of the database |llj . In particular, the prob- 
lem of counting the number of maximal frequent itemsets is #P-complete (see 
also [n]). 

Many algorithm has been proposed for enumerating frequent closed itemsets. 
One can cite Apriori-like algorithm, originally proposed in ilj for mining frequent 
itemsets for association rules. It proceeds by a level- wise search of the elements of 
J-IM.{'D, A). Indeed, it starts by computing the elements of J^XAAiV, A) of size 
one. Then, assuming the element of J-XM.{T>, A) of size n is known, it computes 
a set of candidates of size n -I- 1 so that / is a candidate if and only if all its sub- 
sets are in TXA4{T>, A). This procedure is iterated until no more candidates are 
found. Obviously, this basic procedure is enhanced using some properties such as 
the anti-monotonicity property that allow us to reduce the search space. Indeed, 
if / ^ TXM{V, A), then /' ^ FXM{V, A) for all /' 3 /. In our experiments, we 
consider one of the state-of-the-art algorithm LCM for mining frequent closed 
itemsets proposed by Takeaki Uno et al. in TD". In theory, the authors prove that 
LCM exactly enumerates the set of frequent closed itemsets within polynomial 
time per closed itemset in the total input size. Let us mention that LCM algo- 
rithm obtained the best implementation award of FIMr2004 (Frequent Itemset 
Mining Implementations). 

3 Prom CNF Formula to Transaction Database 

We first introduce the satisfiability problem and some necessary notations. We 
consider the conjunctive normal form (CNF) representation for the propositional 



formulas. A CNF formula ^ is a conjunction of clauses, where a clause is a 
disjunction of literals. A literal is a positive {p) or negated {~'p) propositional 
variable. The two literals p and -^p are called complementary. A CNF formula 
can also be seen as a set of clauses, and a clause as a set of literals. The size of 
the CNF formula <? is defined as \<P\ = X]ces> 1*^1' '^here \c\ is equal to the number 
of literals in c. We denote by I the complementary literal of I. More precisely, if 
/ = p then I is -ip and if I = -ip then I is p. Let us recall that any propositional 
formula can be translated to CNF using Tseitin's linear encoding |^. We denote 
by V<p the set of propositional variables appearing in <!>, while the set of literals 
of ^ is defined as C^ = U^ev^l^;, "la;}. An interpretation B oi a, propositional 
formula <? is a function which associates a value B{p) G {0, 1} (0 corresponds 
to false and 1 to true) to the variables p 6 V$. A model of a formula <P is an 
interpretation B that satisfies the formula: B{<P) = 1. The SAT problem consists 
in deciding if a given CNF formula admits a model or not. 

A CNF formula can be considered as a transaction database, called CNF 
database, where the items correspond to literals and the transactions to clauses. 
Complementary literals correspond to two different items. 

Definition 3 (CNF to V). Let <P = Akjs;„Cj be a CNF formula. The set 
of items X = C<t, and the transaction database associated to <P is defined as 
V% = {{tidi,Ci)\l < i < n} 

In this context, a frequent itemset corresponds to a frequent set of literals: 
the number of clauses containing these literals is greater or equal to the minimal 
threshold. For instance, if we set the minimal threshold A to 2, we get {xi, -1x2} as 
a frequent itemset in the previous database. The set of maximal frequent itemsets 
is the smallest set of frequent set of literals where each frequent set of literals 
is included in at least one of its elements. For instance, the unique maximal 
frequent itemset in the previous example is {xi, -13:2} (A = 2). Furthermore, the 
set of closed frequent itemsets is the smallest set of frequent set of literals where 
each frequent itemset is included in at least one of its elements having the same 
support. For instance, the set of the closed frequent itemsets is {{xi, -'X2}, {xi}}. 

In the definition of a transaction database, we did not require that the set of 
items in a transaction to be unique. Indeed, two different transactions can have 
the same set of items and different identifiers. A CNF formula may contain the 
same clause more than once, but in practice this does not provide any information 
about satisfiability. Thus, we can consider a CNF database as just a set of 
itemsets (sets of literals). 

4 Mining-based Approach for Size-Reduction of CNF 
Formulae 

In this section, we describe our mining based approach, called Mining4SAT, 
for reducing the size of CNF formulae. The key idea consists in searching for 
frequent sets of literals (sub-clauses) and substituting them with new variables 
using Tseitin's encoding [9]. 



4.1 Tseitin's Encoding 

Tseitin's encoding consists in introducing fresh variables to represent sub- formulae 
in order to represent their truth values. Let us consider the following DNF for- 
mula (Disjunctive Normal Form: a disjunction of conjunctions): 

(xi A • • • A X;) V (yi A • • • A y„) V (zi A • • • A z„) 

A naive way of converting such a formula to a CNF formula consists in using the 
distributivity of disjunction over conjunction {AW (B AC) o {AV B) A{A\/ C)): 

{xi V yi V zi) A (xi V yi V Z2) A • • • A {xi Wym"^ Zn) 

Such a naive approach is clearly exponential in the worst case. In Tseitin's trans- 
formation, fresh propositional variables are introduced to prevent such combi- 
natorial explosion, mainly caused by the distributivity of disjunction over con- 
junction and vice versa. With additional variables, the obtained CNF formula 
is linear in the size of the original formula. However the equivalence is only 
preserved w.r.t satisfiability: 

{h V t2 V ts) A (ii ^ (xi A • • • A xi)) A (t2 ^ (2/1 A • ■ • A ?/„)) 

A(t3 ^ (zi A • • • A z„)) 

4.2 A Size-Reduction Method 

Let us consider the following CNF formula ^: 

(xi V • • • V x„ V ai) A • ■ • A (xi V • ■ • V x„ V ak) 

where n ^ 2, k > ^^^, xi, . . . , x„ are literals and ai, . . . ,ak are clauses. The 
number of literals in this formula can be reduced as follows: 

(y V ai) A • • • A (y V afe) A (xi V • • • V x„ V ^y) 

where y is a fresh propositional variable. Indeed, n x k literals are replaced with 
k + n + 1 literals. Clearly, a boolean interpretation is a model of the formula 
obtained after reduction if and only if it is a model of <P. Now, if we consider the 
CNF database corresponding to <P, {xi, . . . , x„} is a frequent itemset where the 
minimal support threshold is greater or equal to k. 

It is easy to see that to reduce the number of literals n must be greater or 
equal to 2. Indeed, if n < 2 then there is no reduction of the number of literals, 
on the contrary, their number is increased. Regarding the value of fc, one can also 
see that such a transformation is interesting only when k > ^^- Thus, there 
are three cases : if n = 2, then k ^ A, else if n = 3 then k ^ 3, k ^ 2 otherwise. 
Therefore, the number of literals is always reduced when k ^ A. 

In the previous example, we illustrate how the problem of finding frequent 
itemsets can be used to reduce the size of a CNF formula. One can see that, 



in general, it is more interesting to consider a condensed representation of the 
frequent itemsets (closed and maximal) to reduce the number of literals. Indeed, 
by using a condensed representation, we consider all the frequent itemsets and 
the number of fresh prepositional variables and new clauses (in our example, y 
and {xi V • • • V a;„ V -^y)) introduced is smaller than that of those introduced 
by using all the frequent itemsets. For instance, in the previous formula, it is 
not interesting to introduce a fresh propositional variable for each subset of 

\X\ , . . . , Xjij. 

Closed vs. Maximal In Section 12. 2[ we introduced two condensed represen- 
tations of the frequent itemsets: closed and maximal. The question is, which 
condensed representation is better? We know that the set of maximal frequent 
itemsets is included in that of the closed ones. Thus, a small number of fresh 
variables and new clauses are introduced using the maximal frequent itemsets. 
However, there are cases where the use of the closed frequent itemsets is more 
suitable. For example, let us consider the following formula: 

(xi V . . . V Xfe V . . . V a;„ V q;i)A 

• • • A (xi V . . . V Xfc V . . . V a;„ V am)A 

(xi V . . . V Xfe V ^i) A • • • A (xi V . . . V a;fc V Pr,^') 

where fc ^ 2, m, tti' ^ 4 and n > k. We assume that the frequent itemsets are 
only the subsets of {xi, . . . , x„}. Therefore, {xi, . . . , x„} is the unique maximal 
itemset and the closed itemsets are {xi, . . . , x„} and {xi, . . . , Xfc}. Let us start 
by using the closed frequent itemset {xi, . . . , x„} in the reduction of the number 
of literals: 

(yVai)A--- A(yVa„,)A 

(xi V . . . V Xfc V ;9i) A • ■ • A (xi V . . . V Xfc V /3rn')A 

(xi V . . . Vx„ V -ly) 

Now, by using {xi, . . . , Xfc}, we get the following formula: 

(j/Vai) A ••■ A (y Vam)A 

(zV^i)A--- A(zV/3„0A 

(z V Xfc+i V . . . V x„ V -y) A (xi V . . . V Xfc V -z) 

In this example, it is clearly more interesting to consider the closed frequent 
itemsets in our Mining4SAT approach. 

In fact, a (closed) frequent itemset / and one of its subsets /' (which can be 
closed) are both interesting if S{I') — S{I) > tjttzj — 1- Indeed, if we apply our 
transformation using /, then the support of /' in the resulting formula is equal 
to S{I') — S{I) + 1, and we know that /' is interesting in the resulting formula 
if its support is greater to ,, _. . 

Overlap Let <? be a set of itemsets. Two itemsets / and /' of <? overlap if 
/ n /' ^ 0. Moreover, I and /' are in the same overlap class if there exist k 
itemsets Ji , . . . , Jfc oi <P such that I ~ Ii,Ik = I' and for all 1 ^ i ^ fc — 1 , /, 
and li+i overlap. 



In our transforniation, one can have some problems when two frequent item- 
sets overlap. For example, if {xi, X2, 2:3} and {x2, 0:3, X4} are two frequent item- 
sets (3 is the minimal support threshold) such that iS({a;i, a;2, a;3}) = 3, S{{x2,X3, 
X4}) = 3 and S{{xi,X2,X3,X4}) = 2, then if we apply our transformation us- 
ing {xi,X2,X3}, then the support of {x2,X3,X4} is equal to 2 (infrequent) in 
the resulting formula and vice versa. Thus, we can not use both of them in the 
transformation. 

Le us note that the overlap notion can be seen as a generalization of the 
subset one. Let / and /' be frequent itemsets such that they overlap. They are 
both interesting in our transformation if: 

1. S{I)-S{lur) > {^-lor5(/')-5(/U/') > {1^-1. This comes from 
the fact that if we apply the transformation using / (resp. /'), then the sup- 
port of/' (resp. /)isequalto5(/')->S(/U/') + l (resp. 5(/) -5(/ U /') + !). 

2. |/\/'| ^ k (resp. |/'\/| ^ k) where fc = 2 if S{I) ^ 4 (resp. S{I') ^ 4), fc = 3 
if S{I) = 3 (resp. S{I') = 3), fc = 4 otherwise. Indeed, in the previous cases, 
/\/' (resp. /'\/) can be used in our transformation. 

Mining4SAT algorithm We now describe our Mining4SAT algorithm using 
the set of closed frequent itemsets. Let us note that the optimal transformation 
using the set of all the closed frequent itemsets can be obtained by an optimal 
transformation using separately the overlap classes of this set. Actually, since 
any two distinct overlap classes do not share any literal, the reduction applied 
to a given formula using the elements of an overlap class does not affect the 
supports of the elements of the other classes. Moreover, one can easily compute 
the set of all the overlap classes of the set of the closed frequent itemsets: let 
G = {V, E) be an undirected graph such that V is the set of the closed frequent 
itemsets and {hjh) is an edge of G if and only if /i and I2 overlap; C is an 
overlap class if and only if it corresponds to the set of vertices of a connected 
component of G which is not included in any other connected component of 
G. For this reason, we restrict here our attention to the reductions that can 
be obtained using a single overlap class. The hole size reduction process can be 
performed by iterating on all the overlap classes. 

Let / be a closed frequent itemset. We denote by a{I) the value S{I) x (|/| — 
1) — |/| — 1 that corresponds to the number of literals reduced by applying our 
transformation with / on a CNF formula. 

Algorithm [1] takes as input a CNF formula (j> and an overlap class C, and 
returns (f> after applying size-reduction transformations. It iterates until there 
is no element in C. In each iteration, it first selects one of the most interesting 
elements in C (line 2): an element 7 of C such that there is no element I' E C 
satisfying a{I') > a{I). Note that this element is not necessarily unique in C 
This instruction means that Algorithm [1] is a greedy algorithm because it makes 
the locally optimal choice at each iteration. Then, it applies our transformation 
using / = {j/i, . . . , y„}: it replaces the occurrences of / with a fresh propositional 
variable x (line 3); and it adds the clause j/i V . . . V y„ V -ix to (line 4). It next 



Algorithm 1 Size Reduction 



Require: A formula (f), an overlap class of closed frequent itemsets C 



1 
2 
3 

4 
5 
6 
7 
8 
9 
10 



wiiile C / do 

/ <— MostInterstingElment{C); 

replace{(j),I, x); 

Add{4),I,x): 

removeiC , I); 

replaceSuhsetiC , I, x)\ 

removeU ninter eating ElementsiC) ; 

updateSupports{C) ; 
end wiiile 
return th 



removes / from C (line 5) and replaces / in the the other elements of C with x 
(line 6). The next instruction (line 7) consists in removing the elements of C that 
could increase the number of literals: the elements that overlap with / and are 
not included in /. As explained before, an element of C overlapping with / does 
not necessarily increase the number of literals. Thus, by removing elements from 
C because only they overlap with /, our algorithm can remove closed frequent 
itemsets decreasing the number of literals. A partial solution to this problem 
consists in recomputing the closed frequent itemsets in the formula returned by 
Algorithm [1] The last instruction in the while loop (line 8) consists in updating 
the supports of the elements remaining in C following the new value of (j): a 
support of an element /' remaining in C changes only when it is included in / 
and its new support is equal to S{I') — S{I) + 1. This instruction also removes 
all the elements of C becoming uninteresting because of the new supports and 
sizes. 



Application: A Compact Representation of Sets of 
Binary Clauses 



Binary clauses (2-CNF formula) are ubiquitous in CNF formula encoding real- 
world problems. Some of them contain more than 50% of binary clauses. How- 
ever, in our size reduction approach, binary clauses are not taken into account. 
Indeed, to reduce the size of the formula, we only search for itemsets of size at 
least two literals. The extremely rare case where a binary clause representing a 
closed frequent itemset can be considered is when it appears at least four times 
in the formula i.e. it subsumes at least 4 clauses. In this section, we first show 
how our mining based approach can be used to achieve a compact representa- 
tion of arbitrary sets of binary clauses. Then, we consider two interesting special 
cases corresponding to sets of binary clauses representing either a clique or a 
bi-clique. 



5.1 Compacting arbitrary set of binary clauses 

In order to reduce the size of the set of binary clauses, we only need to rewrite 
the formula and to slightly modify the Algorithm [1] 

Definition 4 (B-implications). Let <P = Ai^i^iJC^^i V y\) A (xi V j/a) A • ■ • A 
{xi Vy^,)] be a 2- CNF formula. We define i?v[A](^) = Ai<i<„^» V /3i, where 
Pi = ivi A 1/2 /\ • ■ ■ A yli.)- We call (xi V /3i) a B-implication. 

Obviously, the formula <P and i?v[A](^) are equivalent and there exists several 
ways to rewrite <? as a conjunction of B-implications. 

Example 1. Let <P = {aV b) A {a\/ c) A {cV d) be a 2-CNF formula. We can rewrite 

# as -B^[A](^) = (aV [bAc]) A (c V [d]) or as Bl^^^{^) = (a V [6]) A (c V [a Ad]). 

In the sequel, we use a lexicographic ordering on literals of $. In the example [I] 
we obtain B^,^,{<P) using the lexicographic ordering -^a ^ a < -ib ^ b < -ic ^ 
c < -id ^ d. 

Definition 5 (2-CNF to V). Let <P be a 2-CNF formula and Bv[a](^) = 
Ai<i<ra "^^ V (3i. The transaction database associated to (p is defined as D^ = 

{{tldl|3^)\x^y 13, € Bv [A] (<?)}. 

Let us now describe our approach to compact a 2-CNF formula <?, called 
CNF2RED (for reducing the size of sets of binary clauses). First, after rewriting 
<l> as i?v[A](^): we build the transaction database T>^. The set CiSet of closed 
frequent itemsets and its associated overlap classes Oclass are computed. The 
last step aims to reduce the size of the 2-CNF (p using a slightly modified version 
of the Algorithm [1] First the Algorithm [T] takes as input a formula = i?v[A](^) 
and returns 4> after reducing its size. Secondly, for an itemset / = {2/1,2/2, • • ■ , Vn}, 
in line (4) of the Algorithm [TJ we introduce a fresh variable x and we add a bi- 
implication {-^x V [yi A 2/2 A • • • A 2/n]) to (j). 

5.2 Special case of (bi-)clique of binary clauses 

In [7], J. Rintanen addressed the problem of representing big sets of binary 
clauses compactly. He particularly shows that constraint graphs arising from 
practically interesting applications (eg. AI planning) contain big cliques or bi- 
cliques of binary clauses. An identified bi-clique involving the two sets of literals 
X = {xi, 2:2, . . ■ , a;„} and y — {2/1,2/2, • • ■ , 2/m} expresses the propositional for- 
mula (a;i A X2 A • • • A a;„) V (2/1 A 2/2 A • • • A 2/m), while a clique involving the literals 
X — {xi, X2, . . . , Xn} expresses that at-most one literal from X is false, 
Bi-clique of binary clauses Let us explain how a bi-clique can be compacted with 
CNF2RED method. Let if = [{xi V 2/1) A (xi V 2/2) V • • • V (xi V 2/™)] • ■ • [{xn V 
2/1) A {xn V 2/2) V • ■ • V {xn V 2/m)] & bi-cliquc of n X TO binary clauses. Considering 
the lexicographic ordering, i3v[A]('^) corresponds exactly to Ai<i<n(^» ^ [2/1 ^ 
2/2 A • • • A 2/?Ti])- Obviously, the transaction database V^ contains a single closed 
frequent itemset {2/1, 2/2, • ■ • , 2/m}- Applying our algorithm leads to the following 
compact representation of if"' = [Ai<i^n(^j V-^)] '^[Ai^,-^m(~'-^V%)]- We obtain 
exactly the same gain as in [7] {0{n -\- to,) binary clauses and one additional 
variable) . 



Clique of binary clauses Let ^ = /\^^^^^_^[{xi\/Xi+i)A{xi\/Xi^2)'^- ■ •V(a;iVa;„)] 
be a clique of n'^ binary clauses. The formula -Bv[a]('^) = /\i<i<n-ii^i ^ i^i+i ^ 
Xi+2 A • • • A Xn])- If we take a closer look to Z?^, the closed frequent itemset 
/ with greatest value a{I) corresponds to {a;„/2, . . . ,a:„}. In the first ^ rows 
of -D^, I is substituted by a fresh variable x and a new set of binary clauses 
(x V [xnA, ■ • • A x„]) is added to it, leading to two subproblems of size f + 1- 
Obviously, the same treatment is done on the formula i?v[A]('^)- Consequently 
the number of variables is defined by the following recurrence equation: V{n) = 
2V(^ + 1) + 1, V(6) = 1. The basic case is reached for n = 6, where the last 
fresh variable is introduced to represent the conjunction a;4 A 2:5 A xg. For n < 6 
no fresh variable is introduced because no frequent closed itemset can leads to 
a reduction of the size of the formula. Consequently, from the solution of the 
previous recurrence equation, we obtain that our encoding is in 0{n) auxiliary 
variables. Using the same reasoning, we also obtain the same complexity 0{n) 
for the number of binary clauses. This corresponds to the complexity obtained 
in [7]. 

The two special cases of clique and bi-clique of binary clauses considered in 
this section, allow us to show that when a constraint is not well encoded, our 
approach can be used to correct and to derive a more efficient and compact 
encodings automatically. 



6 Experiments 



Instance 


orig. form, size 


red. form, size 


% rmv 


ldlx_c_iq57_a 


190 Mo 


164 Mo 


12,47 % 


6pipc_6.ooo.*-as.sat03-413 


11 Mo 


7,7 Mo 


19,64 % 


9dlx.vliw.at.b.iq6.*-*04-347 


76 Mo 


65 Mo 


14,02 % 


abb313GPIA-9-c.*.sat04-317 


21 Mo 


6,9 Mo 


63,92 % 


E05F18 


3,7 Mo 


2,2 Mo 


43,48 % 


eq.atree.braun.ll.unsat 


120 Ko 


72 Ko 


27,93 % 


eq.atree.braun.l2.unsat 


144 Ko 


88 Ko 


27,66 % 


k2mul. miter. *-as.sat03-355 


1,5 Mo 


1,3 Mo 


11,27 % 


korf-15 


1,2 Mo 


752 Ko 


34,17 % 


rbcljtits.OS.UNSAT 


1,1 Mo 


856 Ko 


16,42 % 


SAT.dat.k45 


3,5 Mo 


2,6 Mo 


24,53 % 


traffic_b_unsat 


18 Mo 


12 Mo 


26,53 % 


xlmul. miter. *-as.sat03-359 


1,1 Mo 


928 Ko 


12,68 % 


9dlx_vliw_at_b_iq3 


19 Mo 


15 Mo 


17,84 % 


9dlx_vliw_at_b_iq4 


31 Mo 


26 Mo 


18,02 % 


AProVE07-09 


2,8 Mo 


2,7 Mo 


4,51 % 


eq.atrcc.braun.lO.unsat 


96 Ko 


56 Ko 


28,30 % 


goldb-hcqc-frglmul 


348 Ko 


328 Ko 


12,66 % 


goldb-hcqc-xlmul 


964 Ko 


896 Ko 


12,68 % 


minandl2S 


7,7 Mo 


2,6 Mo 


65,28 % 


ndlif_xits.09.UNSAT 


2,6 Mo 


2,1 Mo 


18,61 % 


rbcljcits.07.UNSAT 


868 Ko 


720 Ko 


16,49 % 


vclcv-pipc-o-uns-1 . 1-6 


5,5 Mo 


4,4 Mo 


18,89 % 



Table 2. Results of Mining4SAT : a general approach 



In this section, we present an experimental evaluation of our proposed ap- 
proaches. Two kind of experiments has been conducted. The first one deals with 
size reduction of arbitrary CNF formulas using Mining4SAT algorithm, while the 
second one attempts to reduce the size of the 2-CNF sub-formulas only, using 
CNF2RED algorithm. 

Both algorithms are tested on different benchmarks taken from the last SAT 
challenge 2012. From the 600 instances of the application category submitted to 
this challenge, we selected 100 instances while taking at least one instance from 
each family. All tests were made on a Xeon 3.2GHz (2 GB RAM) cluster and 
the timeout was set to 4 hours. 

In Table [2] and Table |3l the field size indicates the size in octets of each SAT 
instance before and after reduction. We also provide %rmv, the percentage of 
the removed literals. To study the influence of our size reduction approaches on 
the solving time, we also run the SAT solver MiniSAT 2.2 on both the original 
instance and on the those obtained after reduction. Due to a lack of space, we 
only present a sample of the whole results. Our goal is to provide some insights 
about the general behavior of our reduction techniques. 

Tabled highlights the results obtained by Mining4SAT general approach. In 
this experiments, and to allow possible reductions, we only search for frequent 
closed itemsets of size greater or equal to 4. Consequently, binary clauses are not 
considered. As we can observe, our Mining4SAT reduction approach allows us to 
reduce the size more than 20% on the majority of instances. Let us also note that 
the maximum (65,28 %) is reached in the case of the instance minandl28: its 
original size is 14 Mo and its size after reduction is 5.4 Mo. For the SAT solving 
time, the results depend on the instances. On some instances we can observe 
real improvements, whereas on others the performances become worse. 

In Table [3l we present a sample of the results obtained by CNF2RED algo- 
rithm on compacting only binary clauses. We observe similar behavior as in the 
first experiment in terms of size reduction However, we observe in general some 
improvements in terms of SAT solving time. 



Instance 


orig. form, size 


red. form, size 


% rmv 


vclcv-pipc-o-uns-1 . 1-6 


5.5 Mo 


3.2 Mo 


43,23 % 


9dlx_vliw_at_b_iq2 


11 Mo 


6 Mo 


42,56 % 


ldlx_c_iq57_a 


190 Mo 


124 Mo 


36,52 % 


7pipc_k 


14 Mo 


5.4 Mo 


59,66 % 


SAT .dat.klOO. debugged 


16 Mo 


13 Mo 


24,89 % 


IBM.FV.2004.rulo.batch 

.2 .31.1.SAT.dat.k80. debugged 


9,7 Mo 


7.5 Mo 


25,56 % 


sokoban-scqucntial-pl45-*.040-* 


24 Mo 


14 Mo 


45,16 % 


oponstacks-*-p30.1.085-* 


30 Mo 


26 Mo 


17,25 % 


aaail0-planning-ipc5-*-12-stepl6 


17 Mo 


12 Mo 


35,35 % 


k2fix_gr_rcs_w8. shuffled 


3,4 Mo 


1,7 Mo 


54,83% 


homer 17. shuffled 


20 Ko 


16 Ko 


39,86 % 


gripperl3u.shuffled-as. sat 03-395 


524 Ko 


364 Ko 


35,03 % 


grid-strips-grid-y-3.045-* 


52 Mo 


42 Mo 


23,48 % 



Table 3. Resuhs of CNF2RED: a 2-CNF approach 



7 Conclusion and Future Works 

In this paper, we propose the first data-mining approach, called Mining4SAT, for 
reducing the size of Boolean formulae in conjunctive normal form (CNF). It can 
be seen as a preprocessing step that aims to discover hidden structural knowledge 
that are used to decrease the number of literals. Mining4SAT combines both 
frequent itemset mining techniques for discovering interesting substructures, and 
Tseitin-based approach for a compact representation of CNF formulae using 
these substructures. Thus, we show in this work, inter alia, that frequent itemset 
mining techniques are very suitable for discovering interesting patterns in CNF 
formulae. 

Since we use a greedy algorithm in our approach, the formula obtained after 
transformation is not guaranteed to be optimal w.r.t. size. An important open 
question, which we will study in future work, is how to optimally use the closed 
frequent itemsets ranging in an overlap class. Integrating the reduction of sets 
of binary clauses in the general Mining4SAT approach is also an interesting 
research perspective. 
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